Introduction to Triton Programming: The Performance Paradox: Why Correct Code is Slow

The Performance Paradox states that a mathematically perfect kernel, such as $out = x + y$, can actually perform worse than a CPU loop if it fails to amortize the fixed costs of the GPU hardware. This often manifests as the Launch Tax.

1. The "Correctness" Fallacy

Functional correctness is not a proxy for efficiency. While your Triton code might correctly distribute work across thousands of threads, if the total amount of work (N) is small, the GPU remains underutilized. The hardware spends more time in state transitions than in actual arithmetic.

2. The Python Measurement Trap

Benchmarking GPU code from Python using time.time() is dangerous. GPU calls are asynchronous; Python merely queues the command and moves on. Without torch.cuda.synchronize(), you measure the queueing time. With synchronization, you measure the Host-to-Device latency, which is often 10x longer than the kernel execution itself.

3. Latency vs. Throughput

To overcome the paradox, you must provide enough work to "hide" the launch latency. This is the transition from a latency-bound regime (limited by the CPU-GPU bus) to a throughput-bound regime (limited by GPU memory or compute).

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.

Case Study: The Overhead Audit

Interpreting Host vs. Device Benchmarks

A developer runs a Triton kernel for Vector Addition on 512 elements. They measure 45 microseconds using Python's `time.time()`. When profiling the same kernel using NVIDIA Nsight Systems, the actual GPU duration is reported as only 2.1 microseconds.

1. What is the approximate 'Launch Tax' in microseconds for this scenario, and what percentage of the total measured time does it represent?

Solution:
The Launch Tax is approximately 42.9 microseconds (45ms total - 2.1ms work). This represents ~95.3% of the total time. This indicates the application is heavily bound by system overhead rather than computation.

2. If the developer increases N to 1,000,000 elements, assuming the kernel now takes 150 microseconds on the GPU, how does the Launch Tax impact the overall efficiency?

Solution:
With a constant launch overhead of ~43us, the total time would be ~193us. The overhead now only accounts for ~22.3% of the time. Efficiency improves as N increases because the fixed cost is spread over a much larger volume of compute/memory work.